# Importing the dataset
dataset = read.csv("./data/Salary_Data.csv")
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling - Not needed - the model takes care of this for us
# (Always read about the model to determine in scaling is not autmatically.)
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Simple Linear Regression to the Training set
regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
y_pred
## 2 4 5 8 11 16 20
## 37766.77 44322.33 46195.35 55560.43 62115.99 71481.07 81782.66
## 21 24 26
## 89274.72 102385.84 109877.90
# Visualising the Training set results
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')
# Visualising the Test set results
library(ggplot2)
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary), colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)), colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') + xlab('Years of experience') + ylab('Salary')
These must be true:
In Linear regression the sample size rule of thumb is that regression analysis requires at least 20 cases per independent variable in the analysis.
linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.
Linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram and a fitted normal curve or a Q-Q-Plot.
Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are not independent from each other. A second important independence assumption is that the error of the mean has to be independent from the independent variables.
Multicollinearity might be tested with 4 central criteria:
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.
Linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).
While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-correlated. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data, however the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are first order effects.
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.
The last assumption the linear regression analysis makes is homoscedasticity. The scatter plot is good way to check whether homoscedasticity (that is the error terms along the regression are equal) is given. If the data is heteroscedastic the scatter plots looks like the following examples:
Recall that ordinary least-squares (OLS) regression seeks to minimize residuals and in turn produce the smallest possible standard errors. By definition, OLS regression gives equal weight to all observations, but when heteroscedasticity is present, the cases with larger disturbances have more “pull” than other observations. In this case, weighted least squares regression would be more appropriate, as it down-weights those observations with larger disturbances.
A more serious problem associated with heteroscedasticity is the fact that the standard errors are biased. Because the standard error is central to conducting significance tests and calculating confidence intervals, biased standard errors lead to incorrect conclusions about the significance of the regression coefficients. Many statistical programs provide an option of robust standard errors to correct this bias. Weighted least squares regression also addresses this concern but requires a number of additional assumptions. Another approach for dealing with heteroscedasticity is to transform the dependent variable using one of the variance stabilizing transformations. A logarithmic transformation can be applied to highly skewed variables, while count variables can be transformed using a square root transformation. Overall however, the violation of the homoscedasticity assumption must be quite severe in order to present a major problem given the robust nature of OLS regression.
Remember, for linear regression you need to transform categorical variables into dummy variables.
# Importing the dataset
dataset = read.csv("./data/50_Startups.csv")
# Encoding categorical data
dataset$State = factor(dataset$State, levels = c('New York', 'California', 'Florida'), labels = c(1, 2, 3))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = Profit ~ ., data = training_set)
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
# Building the optimal model using Backward Elimination
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
## State, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33504 -4736 90 6672 17338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.008e+04 6.953e+03 7.204 5.76e-09 ***
## R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
## Administration -2.700e-02 5.223e-02 -0.517 0.608
## Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
## State2 4.189e+01 3.256e+03 0.013 0.990
## State3 2.407e+02 3.339e+03 0.072 0.943
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
# Optional Step: Remove State2 only (as opposed to removing State directly)
# regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + factor(State, exclude = 2),
# data = dataset)
# summary(regressor)
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
## R.D.Spend 8.057e-01 4.515e-02 17.846 < 2e-16 ***
## Administration -2.682e-02 5.103e-02 -0.526 0.602
## Marketing.Spend 2.723e-02 1.645e-02 1.655 0.105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
regressor = lm(formula = Profit ~ R.D.Spend, data = dataset)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34351 -4626 -375 6249 17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.903e+04 2.538e+03 19.32 <2e-16 ***
## R.D.Spend 8.543e-01 2.931e-02 29.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
## F-statistic: 849.8 on 1 and 48 DF, p-value: < 2.2e-16
# Importing the dataset
dataset = read.csv("./data/50_Startups.csv")
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'), labels = c(1, 2, 3))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = Profit ~ ., data = training_set)
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
# Importing the dataset
dataset = read.csv("./data/Position_Salaries.csv")
dataset = dataset[2:3]
# Splitting the dataset into the Training set and Test set
# # install.packages('caTools')
# library(caTools)
# set.seed(123)
# split = sample.split(dataset$Salary, SplitRatio = 2/3)
# training_set = subset(dataset, split == TRUE)
# test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Linear Regression to the dataset
lin_reg = lm(formula = Salary ~ ., data = dataset)
# Fitting Polynomial Regression to the dataset
dataset$Level2 = dataset$Level^2
dataset$Level3 = dataset$Level^3
dataset$Level4 = dataset$Level^4
poly_reg = lm(formula = Salary ~ ., data = dataset)
# Visualising the Linear Regression results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(lin_reg, newdata = dataset)), colour = 'blue') +
ggtitle('Truth or Bluff (Linear Regression)') + xlab('Level') + ylab('Salary')
# Visualising the Polynomial Regression results
# install.packages('ggplot2')
library(ggplot2)
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)), colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') + xlab('Level') + ylab('Salary')
# Visualising the Regression Model results (for higher resolution and smoother curve)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = x_grid, y = predict(poly_reg,
newdata = data.frame(Level = x_grid, Level2 = x_grid^2, Level3 = x_grid^3, Level4 = x_grid^4))),
colour = 'blue') + ggtitle('Truth or Bluff (Polynomial Regression)') + xlab('Level') + ylab('Salary')
# Predicting a new result with Linear Regression
predict(lin_reg, data.frame(Level = 6.5))
## 1
## 330378.8
# Predicting a new result with Polynomial Regression
predict(poly_reg, data.frame(Level = 6.5, Level2 = 6.5^2, Level3 = 6.5^3, Level4 = 6.5^4))
## 1
## 158862.5
# Importing the dataset
dataset = read.csv("./data/Position_Salaries.csv")
dataset = dataset[2:3]
# Splitting the dataset into the Training set and Test set
# # install.packages('caTools')
# library(caTools)
# set.seed(123)
# split = sample.split(dataset$Salary, SplitRatio = 2/3)
# training_set = subset(dataset, split == TRUE)
# test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting SVR to the dataset
# install.packages('e1071')
library(e1071)
regressor = svm(formula = Salary ~ ., data = dataset, type = 'eps-regression', kernel = 'radial')
# Predicting a new result
y_pred = predict(regressor, data.frame(Level = 6.5))
# Visualising the SVR results
# install.packages('ggplot2')
library(ggplot2)
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)), colour = 'blue') +
ggtitle('Truth or Bluff (SVR)') + xlab('Level') + ylab('Salary')
# Visualising the SVR results (for higher resolution and smoother curve)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))), colour = 'blue') +
ggtitle('Truth or Bluff (SVR)') + xlab('Level') + ylab('Salary')
Remember, descision tress are non-continuous and non-linear.
# Importing the dataset
dataset = read.csv("./data/Position_Salaries.csv")
dataset = dataset[2:3]
# Splitting the dataset into the Training set and Test set
#Do not have enough data to split - use it all
# Feature Scaling - Do not need to scale because not based on Euclidean distances
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Decision Tree Regression to the dataset
# install.packages('rpart')
library(rpart)
regressor1 = rpart(formula = Salary ~ ., data = dataset)
# See what happens with minsplit below.
regressor2 = rpart(formula = Salary ~ ., data = dataset, control = rpart.control(minsplit = 1))
# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor2, data.frame(Level = 6.5))
# Visualising the Decision Tree Regression results (higher resolution)
# This only works in 2D
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor1, newdata = data.frame(Level = x_grid))), colour = 'blue') +
ggtitle('Decision Tree Regression - without minsplit') + xlab('Level') + ylab('Salary')
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor2, newdata = data.frame(Level = x_grid))), colour = 'blue') +
ggtitle('Decision Tree Regression - with minsplit') + xlab('Level') + ylab('Salary')
minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.
# Plotting the tree
plot(regressor2)
text(regressor2)
# Importing the dataset
dataset = read.csv("./data/Position_Salaries.csv")
dataset = dataset[2:3]
# Splitting the dataset into the Training set and Test set
# Not needed
# Feature Scaling Not needed
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Random Forest Regression to the dataset
# install.packages('randomForest')
library(randomForest)
set.seed(1234)
regressor = randomForest(x = dataset[-2], y = dataset$Salary, ntree = 500)
# Predicting a new result with Random Forest Regression
y_pred = predict(regressor, data.frame(Level = 6.5))
# Visualising the Random Forest Regression results (higher resolution)
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))), colour = 'blue') +
ggtitle('Truth or Bluff (Random Forest Regression)') + xlab('Level') + ylab('Salary')
Notice the model in the lower left is best performing - examine the adjusted R^2 value
Always evaluate coefficients carefully. Holding all else constant, for every unit of R.D.Spend, your dependent variable - Profit - will increase by .7966 cents. (Profit and R.D.Spend both expressed as dollars.)
Regularization is a technique used in an attempt to solve the over fitting problem in statistical models.
The regularization parameter lambda and the penalty parameter C are hyperparameters. Can use these to optimize the models. Grid Search helps find the optimal values. Using the Caret package provides this optimization.
L1 (Lasso) regularization helps perform feature selection in sparse feature spaces, and that is a good practical reason to use L1 in some situations. However, beyond that particular reason I have never seen L1 to perform better than L2 (Ridge) in practice. Even in a situation where you might benefit from L1’s sparsity in order to do feature selection, using L2 on the remaining variables is likely to give better results than L1 by itself.
Logistic Regression is a linear model as the plots will show.
We can apply a sigmoid function:
We can calculate probabilities:
But logistic regression is binary so 0/1 makes more sense:
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]#Only using Age and Salary
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling - used in classification
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Logistic Regression to the Training set
classifier = glm(formula = Purchased ~ ., family = binomial, data = training_set)
# Predicting the Test set results
# The reponse gives the predicted probabilities in a single vector
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
y_pred[1:10]
## 2 4 5 9 12 18 19 20 22 29
## 0 0 0 0 0 0 0 0 1 0
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
cm
## y_pred
## 0 1
## 0 57 7
## 1 10 26
# Visualizing the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)# expand.grid creates a data frame from all combinations of the supplied vectors/factors
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualizing the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
k-NN is a non-linear model.
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting K-NN to the Training set and Predicting the Test set results
library(class)
y_pred = knn(train = training_set[, -3], test = test_set[, -3], cl = training_set[, 3], k = 5, prob = TRUE)
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
#y_grid = predict(classifier, newdata= grid_set) - recall y_pred abouve was defined using test and train do copy it in place
#of classifier and change test = grid_set
y_grid = knn(train = training_set[, -3], test = grid_set, cl = training_set[, 3], k = 5)
plot(set[, -3],
main = 'K-NN (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = knn(train = training_set[, -3], test = grid_set, cl = training_set[, 3], k = 5)
plot(set[, -3],
main = 'K-NN (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Purchased ~ ., data = training_set, type = 'C-classification', kernel = 'linear')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'SVM (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
The data is not linearly separable.
The classifier will map functions to higher dimensional space to make a non-separable 1 dimension set to a separable data set:
Mapping to a higher dimensional space can be highly compute intensive. The kernal trick helps avoid this. This is not explored in this document.
There are different kernal functions:
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Kernel SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Purchased ~ ., data = training_set, type = 'C-classification', kernel = 'radial')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'Kernel SVM (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'Kernel SVM (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
Here is the raw definition:
The Naive Bayes algorithm is a classification algorithm based on Bayes’ theorems and can be used for both exploratory and predictive modeling. The word naïve in the name Naïve Bayes derives from the fact that the algorithm uses Bayesian techniques but does not take into account dependencies that may exist.
Bayes is based on probabilities based on some information that we already have. Will explain using an example on wrenches produced by two machines.
It is actually very intuitive. Here is an example assuming 1000 were manufactured:
The requirements for a Naive Bayes model are as follows:
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-3], y = training_set$Purchased)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
y_pred[1:10]
## [1] 0 0 0 0 0 0 0 1 0 0
## Levels: 0 1
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
cm
## y_pred
## 0 1
## 0 57 7
## 1 7 29
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'SVM (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
CART - Classification and Regression Trees: Classification Trees work with categorical data. Regression tress help you predict with an output of real numbers.
Decision Trees are an old method and not very popular today. Have largely been reborn with Random Forest and Gradient Boosting.
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
classifier = rpart(formula = Purchased ~ ., data = training_set)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')#class tranforms results from probability to 0/1.
#y_grid must be a vector
plot(set[, -3],
main = 'Decision Tree Classification (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree Classification (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Plotting the tree
plot(classifier)
text(classifier)
Random Forest is a type of ensemble learning - multiple machine learning algorithms and put them together for one better performing algorithm. The different algorithms could be the same or different models.
# Importing the dataset
dataset = read.csv("./data/Social_Network_Ads.csv")
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Random Forest Classification to the Training set
# install.packages('randomForest')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[-3], y = training_set$Purchased, ntree = 500)
# Note: y - is a response vector. If a factor, classification is assumed, otherwise regression is assumed.
# If omitted, randomForest will run in unsupervised mode.
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
y_pred[1:10]
## 2 4 5 9 12 18 19 20 22 29
## 0 0 0 0 0 1 1 1 0 0
## Levels: 0 1
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
cm
## y_pred
## 0 1
## 0 54 10
## 1 5 31
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, grid_set)
plot(set[, -3],
main = 'Random Forest Classification (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, grid_set)
plot(set[, -3], main = 'Random Forest Classification (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
There is evidence of overfitting in the plots above.
# Choosing the number of trees
plot(classifier)
False Negatives (Type II) can be worse than Type I. Predicting an earthquake will not happen and it does happen is much worse than the converse.
Look at the plot below. Since the accuracy is 98%, we could just tell the model to stop making predictions altogether since the accuracy is so high. Simply always assume the event will never occur (always be 0).
Simply always assume the event will never occur (always be 0). Then the confusion matrix looks like this:
The accuracy rate went up - that makes no sense. Therefore you should not always rely only on accuracy. The way to avoid this is explained below.
As an example, we know that 10% of customers will respond to an email campaign. If we create models to identify the users most likely to respond, we can increase the area under the curve.
CAP != ROC
Rule of thumb (this is a personal intuition):
If Too Good - overfitting!
How do I know which model to choose for my problem?
Clustering is similar to classification, but the basis is different. In Clustering you don’t know what you are looking for, and you are trying to identify some segments or clusters in your data. When you use clustering algorithms on your dataset, unexpected things can suddenly pop up like structures, clusters and groupings you would have never thought of otherwise.
Lets work through an example to understand how this works:
The last step iterates until the model cannot be improved It might look like this:
If the initial centroids are way off, problematic problems might arise.
This is the intuitive result:
But if the initial centroids are picked poorly, different results might occur:
There is an addition to the k-Means algorithm that avoids this problem: K-Means++. This happens in the background so it is easy to avoid.
WCSS - Within Clusters Sum of Squares. Useful to compare the goodness of fit of two K-Means clusterings. Lets look at an example:
You can have as many clusters as there are points. WCSS = 0 when each point is a centroid. It constantly decreases. However, diminishing returns suggest the optimal k value (called the Elbow Method):
The elbow is not always so obvious. Use your judgement.
# Importing the dataset
dataset = read.csv("./data/Mall_Customers.csv")
dataset = dataset[4:5]
# Using the elbow method to find the optimal number of clusters
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)
plot(1:10, wcss, type = 'b', main = paste('The Elbow Method'), xlab = 'Number of clusters', ylab = 'WCSS')
# Fitting K-Means to the dataset
set.seed(29)
kmeans = kmeans(x = dataset, centers = 5)
y_kmeans = kmeans$cluster
# Visualising the clusters
library(cluster)
clusplot(dataset, y_kmeans, lines = 0, shade = TRUE, color = TRUE, labels = 2, plotchar = FALSE,
span = TRUE, main = paste('Clusters of customers'), xlab = 'Annual Income', ylab = 'Spending Score')
library(ggplot2)
library(reshape)
# Case study 1: the basics of the K-means in R
set.seed(1121)
## Set three vectors, each with 100 entries
l <- 100
## x with a mean of 1 and a sd of 0.5
x <- rnorm(l,1,0.5)
mean(x);var(x);sd(x);summary(x)
## [1] 0.9746369
## [1] 0.2073973
## [1] 0.4554089
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.02457 0.70670 1.03000 0.97460 1.20700 2.00000
## y with a mean of 5 and a sd of 1
y <- rnorm(l,5,1)
mean(y);var(y);sd(y);summary(y)
## [1] 5.097115
## [1] 1.056125
## [1] 1.027679
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.803 4.379 5.167 5.097 5.747 8.290
## z with a mean of 4 and a sd of 0.4
z <- rnorm(l,4,0.4)
mean(z);var(z);sd(z);summary(z)
## [1] 4.000337
## [1] 0.1765497
## [1] 0.4201781
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.904 3.707 3.970 4.000 4.288 5.174
# set dataset
dataset <- data.frame(x=x,y=y,z=z)
dataset_m <- melt(dataset)
dataset_m <- cbind(id=1:(l*3),dataset_m)
colnames(dataset_m) <- c("id","class","value")
# Now plot the distribution of the artificial dataset in order to check out the
# distribution of the entries in the dataset:
ggplot(dataset_m) + aes(x=value) +
geom_density(aes(fill="red",colour="red"),alpha=0.4) +
theme(legend.position="none") + labs(title="Density plot of all values (omitting classes)")
ggplot(dataset_m) + aes(x=value) +
geom_density(aes(fill=class,colour=class),alpha=0.4) + labs(title="Density plot for all classes")
# Scatter plot of dataset
ggplot(dataset_m) + aes(x=id,y=value) +
geom_point(aes(shape=class),size=2.5,alpha=0.9) + labs(title="Scatter plot of dataset")
## set the number of centroids with parameter k
k <- 2
## run the Kmeans algorithm
km <- kmeans(dataset_m$value,centers=k)
dataset_km <- cbind(dataset_m,cluster=as.factor(km$cluster))
# plot each sample
ggplot(dataset_km) + aes(x=id,y=value) +
geom_point(aes(shape=class,colour=cluster),size=2.5,alpha=0.9) + labs(title="Dataset K-means clustering | k=2")
## set the number of centroids with parameter k
k <- 3
## run the Kmeans algorithm
km <- kmeans(dataset_m$value,centers=k)
dataset_km <- cbind(dataset_m,cluster=as.factor(km$cluster))
# plot each sample
ggplot(dataset_km) + aes(x=id,y=value) +
geom_point(aes(shape=class,colour=cluster),size=2.5,alpha=0.9) + labs(title="Dataset K-means clustering | k=3")
## set the number of centroids with parameter k
k <- 4
## run the Kmeans algorithm
km <- kmeans(dataset_m$value,centers=k)
dataset_km <- cbind(dataset_m,cluster=as.factor(km$cluster))
# plot each sample
ggplot(dataset_km) + aes(x=id,y=value) +
geom_point(aes(shape=class,colour=cluster),size=2.5,alpha=0.9) +
labs(title="Dataset K-means clustering | k=4")
# COMPUTE THE BEST VALUE FOR PARAMETER K
# TRAIN THE MODEL ITERATING THE VALUES OF K
max_k_size <- 10
error <- numeric(max_k_size)
for (i in 1:max_k_size){error[i] <- sum(kmeans(dataset,centers=i)$withinss)}
# set data frame object
error <- data.frame(k=1:max_k_size,error=error)
ggplot(error) + aes(x=k,y=error) + geom_point(colour="#f96161",size=3) +
ylim(0,150) + scale_x_continuous(breaks=seq(0, 15, 1)) +
geom_line(colour="#f96161") +
labs(title="Error calibration curve",
x="Number of Clusters (k)",
y="Within groups sum of squares") +
geom_vline(xintercept=3,colour="#32ab9f", linetype = "longdash")
Very similar results often with Hierarchical Clustering as in k-Means. The way you get to the results is different.
There are two types:
Note we are talking about the proximity of clusters, not points. This is a crucial element to consider in your approach.
There are different ways to measure cluster distances (distances between centroids is shown):
Lets understand this through an example:
The algo remembers how we got to one cluster. This is what a dendogram illustrates.
Note: The height of the horizontal lines represent the size of the Euclidean distance between points. Example: P_2 and P_3 have the lowest height illustrating that those two points are the closest together compared to all other points.
The plot above shows we are evaluating 2 clusters (how many times the horizontal line crosses the dendogram.) The breakpoint appears to be ~ 1.75.
This one has 4 clusters.
The highest vertical distance in a dendogram that does not cross any extended horizontal lines suggests the optimal number of clusters.
Above, it is not critical where the 2 Clusters horizonatl line is placed becase it will only include 2 clusters. As shown, it appears to cross the Euclidean distance (Y value) at 1.75 but it could also cross at 1.0 or 2.5.
Here is another example:
# Importing the dataset
dataset = read.csv("./data/Mall_Customers.csv")
dataset = dataset[4:5]
# Using the dendrogram to find the optimal number of clusters
dendrogram = hclust(d = dist(dataset, method = 'euclidean'), method = 'ward.D')#ward.D minimizes that variance in each cluster
plot(dendrogram, main = paste('Dendrogram'), xlab = 'Customers', ylab = 'Euclidean distances')
# Fitting Hierarchical Clustering to the dataset
y_hc = cutree(dendrogram, 5)
#Cuts a tree, e.g., as resulting from hclust, into several groups either by specifying the desired number(s)
#of groups or the cut height(s).
# Visualising the clusters - this only works for 2 dimensions
library(cluster)
clusplot(dataset, y_hc, lines = 0, shade = TRUE, color = TRUE, labels= 2, plotchar = FALSE, span = TRUE,
main = paste('Clusters of customers'), xlab = 'Annual Income', ylab = 'Spending Score')
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
We have explored how to perform clustering but a question remains - Should we?
A big issue in cluster analysis is clustering methods will alsways return clusters even if the data does not have any meaningful clusters. If you blindly apply a clustering methid on data, it will divide the data into clusters because that is what it is designed to do.
This document provides guidance to evaluate if the data contains meaningful clusters. This is call Clustering Tendency.
To assist us in clustering tendency, we will use two R Packages.
if(!require(easypackages)){
install.packages("easypackages")
library(easypackages)
}
packages("factoextra", "clustertend", prompt = FALSE)
To demo cluster tendency, we need data that illustrates clear clustering tendency (from the iris dataset) and one that is radomized. Lets get this data now.
df <- iris[,-5]
head(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
random_df <- apply(df, 2, function(x){runif(length(x), min(x), (max(x)))})#this makes a matrix
#Recall the 2 in apply specifies columns
random_df <- as.data.frame(random_df)
#Normalize
df <- scale(df)
random_df <- scale(random_df)
Since our data contains more than 2 variables, we need to reduce the dimensionality to do a scatter plot. Use PCA to do this. We will then leverage factoextra to plot.
fviz_pca_ind(prcomp(df), title="PCA-Iris Data", habillage = iris$Species, palette = "jco",
geom="point", ggtheme=theme_classic())
# habillage: color the individuals among a categorical variable (give the number of the categorical
# supplementary variable or its name)
fviz_pca_ind(prcomp(random_df), title="PCA-Random Data", geom="point", ggtheme=theme_classic())
It is clear that the iris data has well-defined clusters where the random data has no such indication. Therefore, we conclude we have data we can use to demo cluster tendency.
Lets see what the output is for performing k-means and hierarchial clustering on the random data. For visualization, I will continue to use factoextra.
Here is a k-means visualization of the random data:
kmeans1 <- kmeans(random_df, 3)
fviz_cluster(list(data = random_df, cluster=kmeans1$cluster), ellipse.type = "norm", geom="point", stand=FALSE,
palette="jco", ggtheme=theme_classic())
Here is a Hierarchial visualization of the random data:
fviz_dend(hclust(dist(random_df)), k=3, k_colors = "jco", as.ggplot=TRUE, show_labels=FALSE)
As illustrated above, even when there are no clusters in the data, the algorithms will still do its job and try to find some. This can be misleading. This is why cluster tendency is a required step for any clustere analysis.
There are two primary methods to assess clustering tendency:
The Hopkins Statistic assess clustering tendency by measuring the probability that a given data set is generated by a uniform data dsitribution. It tests the spatial randomness of the data.
The Hopkins Static calculates a value, H. A value where H = 0.5 suggests the data is uniformly distributed. The null and alternative hypotheses are defined as:
If H is close to 0, then reject the null hypothesis and conclude the data is significantly clusterable.
Calculate H for the iris data:
clustertend::hopkins(df, n=nrow(df)-1)
## $H
## [1] 0.1810241
Calculate H for the random data:
clustertend::hopkins(random_df, n=nrow(random_df)-1)
## $H
## [1] 0.4882823
It is shown that the iris data is highly clusterable (H = ~ 0.17 which is significantly below the 0.50 threshold). The random data is not clusterable since the H vlaue is close to the threshold.
Using VAT, we compute the dissimilarity matrix between observations using the function dist. Next the function fviz_dist() is used to display the matrix.
fviz_dist(dist(df), show_labels = FALSE) + labs(title="Iris Data")
fviz_dist(dist(random_df), show_labels = FALSE) + labs(title="Random Data")
Red - high similarity | Blue - low similarity. The color level is proportional to the value of the dissimilarity between observations.
The matrix confirms there is cluster structure in iris but not in the random data.
There are 3 ways to evaluate the goodness of clustering algorithm results:
Recall that the goal of partitioning clustering algorithms is to split the data into clusters of objects such that:
Minimize the average distance in a cluster and maximize the distance between clusters
if(!require(easypackages)){
install.packages("easypackages")
library(easypackages)
}
packages("factoextra", "fpc", "NbClust", prompt = FALSE)
df <- iris[,-5]
df <- scale(df)
We have performed clustering analysis previously. We will use new functions to perform clustering analysis. While there is overlap with the work we have already completed, these tools also provide silhouette information. This introduces eclust() (enhanced clustering in the factoextra package) that provides many useful advantages:
Lets explore eclust().
kmeans <- eclust(df, "kmeans", k=3, nstart=25, graph = FALSE)#graph = TRUE provides a visualiztion, just not as nice as below
fviz_cluster(kmeans, geom="point", ellipse.type = "norm", palette = "jco", ggtheme=theme_minimal())
hc <- eclust(df, "hclust", k=3, hc_metric = "euclidean", hc_method = "ward.D2", graph = FALSE)#graph = TRUE provides a visualiztion, just not as nice as below
fviz_dend(hc, geom="point", show_labels = FALSE, palette = "jco", as.ggplot=TRUE)
There are two commonly used indices for assessing the goodness of clustering:
Skipping the math, The Silhouette Coefficient is interpreted as follows:
Now is time to test the goodness of fit using the Silhouette Coefficient:
fviz_silhouette(kmeans, palette = "jco", ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 50 0.64
## 2 2 47 0.35
## 3 3 53 0.39
Lets also get some specific silhouette information:
silinfo <- kmeans$silinfo
names(silinfo)
## [1] "widths" "clus.avg.widths" "avg.width"
#Widths of each observation"
head(silinfo$widths[,1:3], 10)
## cluster neighbor sil_width
## 1 1 3 0.7341949
## 41 1 3 0.7333345
## 8 1 3 0.7308169
## 18 1 3 0.7287522
## 5 1 3 0.7284741
## 40 1 3 0.7247047
## 38 1 3 0.7244191
## 12 1 3 0.7217939
## 28 1 3 0.7215103
## 29 1 3 0.7145192
#Ave width"
silinfo$clus.avg.widths
## [1] 0.6363162 0.3473922 0.3933772
#Total ave (mean of all indiv widths)
silinfo$avg.width
## [1] 0.4599482
#The size of each cluster"
kmeans$size
## [1] 50 47 53
Recall observations with a large S1 (almost 1) are well clustered. Examine the plot - cluster 2. Do you see the negative values? This means these observations are in the wrong cluster. We can find the name of these samples and determine the clusters they are closer to:
sil <- kmeans$silinfo$widths[,1:3]
print("Objects with negative silhouette")
## [1] "Objects with negative silhouette"
neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop=FALSE]
## cluster neighbor sil_width
## 112 2 3 -0.01058434
## 128 2 3 -0.02489394
The Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. The Dunn Index has a value between zero and infinity, and should be maximized.
The function cluster.stats() from the fpc package and the NbClust() in the eponymously named package can be used to compute the Dunn Index.
kmeans_stats <- cluster.stats(dist(df), kmeans$cluster)
#Dun Index
kmeans_stats$dunn
## [1] 0.02649665
#Much more detail is availble:
#kmeans_stats
This is computed using the Rand Index which varies between -1 (no agreement) to 1 (perfect agreement). The Rand Index provides a measure for assessing similarity between two partitions adjusted for chance.
Among the values returned by cluster.stats are two indices that asses the similarity of clusters: Rand Index and Meila’s VI.
We know iris has 3 groups of species. Does the K-Means clustering match the true stricture of the data? Lets find out. Compute a cross-tabulation between K-Means and the reference Species label:
table(iris$Species, kmeans$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 11 39
## virginica 0 36 14
This table shows:
Lets quantify the agreement between Species and K-Means using the Rand Index and Meila’s VI:
#Compute cluster stats
species <- as.numeric(iris$Species)
cluster_stats <- cluster.stats(d=dist(df), species, kmeans$cluster)
#Corrected Rand Index
cluster_stats$corrected.rand
## [1] 0.6201352
#VI
cluster_stats$vi
## [1] 0.7477749
Agreement between the specie types and the cluster solution is 0.62 using Rand Index and 0.748 using Meila’s VI. The same analysis can be computed for PAM (Partitioning Around Medoids) and hierarchical clustering.
(The PAM algorithm was developed by Leonard Kaufman and Peter J. Rousseeuw, and this algorithm is very similar to K-means, mostly because both are partitional algorithms, in other words, both break the dataset into groups (clusters), and both work by trying to minimize the error, but PAM works with Medoids, that are an entity of the dataset that represent the group in which it is inserted, and K-means works with Centroids, that are artificially created entity that represent its cluster.)
#Agreement between species and pam clusters
pam <- eclust(df, "pam", k=3, graph=FALSE)
table(iris$Species, pam$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 9 41
## virginica 0 36 14
cluster.stats(d=dist(df), species, pam$cluster)$vi
## [1] 0.7129034
#Agreement bwtween species and HC clusters
hc <- eclust(df, "hclust", k=3, graph=FALSE)
table(iris$Species, hc$cluster)
##
## 1 2 3
## setosa 49 1 0
## versicolor 0 27 23
## virginica 0 2 48
cluster.stats(d=dist(df), species, hc$cluster)$vi
## [1] 0.6944976
Here is more detailed information on other clustering analysis options:
We will use the clValid package which can be used to simultaneously compare multiple clustering algorithms in a single function call to find the best algo and the optimal number of clusters.
The clValid package leverages two cluster validation methods:
Stability Measures: This is a special version on internal measures evaluating the consistency of a clustering result by comparing it with the clusters obtained after each column is removed one at a time. This includes detail which I will not detail:
df <- scale(iris[, -5])
#Compute clValid
clmethods <- c("hierarchical", "kmeans", "pam")
internal <- clValid(df, nClust=2:6, clMethods=clmethods, validation="internal")
summary(internal)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 0.9762 5.5964 7.5492 18.0508 24.7306
## Dunn 0.2674 0.1874 0.2060 0.0700 0.0762
## Silhouette 0.5818 0.4803 0.4067 0.3746 0.3248
## kmeans Connectivity 0.9762 23.8151 25.9044 40.3060 40.1385
## Dunn 0.2674 0.0265 0.0700 0.0808 0.0808
## Silhouette 0.5818 0.4599 0.4189 0.3455 0.3441
## pam Connectivity 0.9762 23.0726 31.8067 35.7964 44.5413
## Dunn 0.2674 0.0571 0.0566 0.0642 0.0361
## Silhouette 0.5818 0.4566 0.4091 0.3574 0.3400
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 0.9762 hierarchical 2
## Dunn 0.2674 hierarchical 2
## Silhouette 0.5818 hierarchical 2
The summary above suggests that hierarchical clustering with 2 clusters performs the best in each case. Regardless of the clustering algorithm, the optimal number of clusters appears to be 2 using these 3 measures. (Of course we know that there are really 3 clusters).
Stability measure can be used like this:
#clmethods <- c("hierarchical", "kmeans", "pam")
stab <- clValid(df, nClust=2:6, clmethods=clmethods, validation="stability")
optimalScores(stab)
## Score Method Clusters
## APN 0.003266667 hierarchical 2
## AD 1.225067382 hierarchical 6
## ADM 0.016087089 hierarchical 2
## FOM 0.510086909 hierarchical 6
The results above seem to confuse the number of clusters even more. APN and ADM suggest 2 clusters and the others 6. Hmm, not sure what to make of this package!
K-Means Clustering is the most popular algorithm for clustering analysis. However, it has limitations including:
Below we explore some alternative strategies that intend to overcome the limitations.
A hybrid method called Hierarchical K-Means Clustering improved upon the base K-Means.
df <- scale(iris[,-5])
newKmeans <- hkmeans(df,3)
fviz_dend(newKmeans, cex=0.6, palette = "jco", rect = TRUE, rect_border = "jco", rect_fill = TRUE)
fviz_cluster(newKmeans, palette="jco", repel=TRUE, ggtheme=theme_classic())
Fuzzy Clustering is considered as soft clustering where each element has a probability of belonging to each cluster. Each element has a set of coefficients corresponding to the degree of being in each cluster. This is different from K-Means clustering where each object is associated to one cluster only.
Fuzzy C-Means (FCM) is the most widely used fuzzy clustering algorithm. fanny(), part of the cluster package, is used to compute fuzzy clustering.
fuzzy <- fanny(df, 3)# fuzzy clustering with k=3
fviz_cluster(fuzzy, ellipse.type = "norm", repel = TRUE, palette="jco", ggtheme=theme_classic(), lengend="right")
fviz_silhouette(fuzzy, palette="jco", ggtheme=theme_minimal())
## cluster size ave.sil.width
## 1 1 50 0.63
## 2 2 51 0.33
## 3 3 49 0.41
DBSAN (Density-Based Spatial Clustering and Application with Noise) is a popular density clustering algorithm. Density Clustering is useful when the clusters might be irregularly shaped.
We will use a different data set to illustrate this.
df2 <- multishapes[1:2]
plot(df2)
km <- kmeans(df2, 5, nstart=25)
fviz_cluster(km, df2, geom = "point", ellipse = FALSE, show.clust.cent = FALSE, palette="jco", ggtheme=theme_classic())
Using K-Means above inaccurately identifies the 5 clusters we know to exist.
mydbscan <- dbscan(df2, eps=0.15, MinPts=5)
fviz_cluster(mydbscan, df2, stand=FALSE, ellipse = FALSE, show.clust.cent = FALSE, geom = "point", palette="jco", ggtheme=theme_classic())
The results of dbscan are available to review:
print(mydbscan)
## DBSCAN clustering for 1100 objects.
## Parameters: eps = 0.15, minPts = 5
## The clustering contains 5 cluster(s) and 31 noise points.
##
## 0 1 2 3 4 5
## 31 410 405 104 99 51
##
## Available fields: cluster, eps, minPts
The column name 0 above indicates the outliers.
dbscan requires users to specify the optimal eps values and the parameter MinPts. How do you define these? (MinPts you simply manually adjust.)
There is a function to help identify the eps value:
dbscan::kNNdistplot(df2, k=5)
abline(h=0.15, lty=2)
I added the horizontal line for emphasis. You can see that the elbow appears to be close to y=0.15.
In the fpc package, a function called predict.dbscan is available to make predictions on new data using the format: predict.dbscan(object, data, newdata)
Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity.
The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them.
The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.
data <- read.table("./data/SimulatedData.txt", header = TRUE, quote = "\"")
plot(data)
res <- NbClust(data, distance = "euclidean", min.nc = 2, max.nc = 8, method = "ward.D2",
index = "duda")
res$All.index
## 2 3 4 5 6 7 8
## 0.0388 0.0738 0.5971 0.6691 0.6602 0.6210 0.4200
res$Best.nc
## Number_clusters Value_Index
## 4.0000 0.5971
res$All.CriticalValues
## 2 3 4 5 6 7 8
## 0.4349 0.4349 0.3327 0.3327 0.3327 0.3327 0.2234
res$Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [176] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
res <- NbClust(data, distance = "euclidean", min.nc = 2, max.nc = 8, method = "complete",
index = "alllong")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 3 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 19 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
res$All.index
## KL CH Hartigan CCC Scott Marriot TrCovW
## 2 3.4594 885.3975 398.1129 17.3825 530.5325 39798.188 228.9249
## 3 1.2566 1524.1400 1126.0955 17.4487 761.9166 28157.985 237.8424
## 4 26.9498 7163.1259 17.0799 30.6642 1247.1374 4424.207 223.0371
## 5 0.9651 5814.9538 11.1679 26.0102 1287.0337 5662.677 151.7204
## 6 2.7236 4895.3856 22.8628 22.4624 1306.3867 7402.183 145.2914
## 7 2.1804 4540.5372 23.4182 20.3053 1351.0258 8059.734 123.0103
## 8 1.2075 4344.8386 20.6834 18.7214 1403.6219 8092.706 106.9838
## TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2
## 2 676.8594 295.4549 13.8329 0.3601 0.4635 0.7209 0.0388 2430.0879
## 3 224.8201 328.3267 41.6464 0.3043 0.2878 0.7794 0.0738 1230.0792
## 4 33.4742 564.6962 279.7063 0.3058 0.2415 0.8441 0.7175 18.9003
## 5 30.7910 634.2843 304.0806 0.3428 0.7714 0.6948 0.7679 14.5044
## 6 29.1231 661.0045 321.4956 0.3515 0.9099 0.5547 0.6189 29.5544
## 7 26.0528 739.4890 359.3836 0.3425 1.0936 0.4442 0.6772 22.8848
## 8 23.2337 858.2385 402.9905 0.3238 1.2702 0.2986 0.4896 28.1467
## Beale Ratkowsky Ball Ptbiserial Gap Frey McClain Gamma
## 2 24.5463 0.6392 338.4297 0.8002 0.0643 0.9342 0.2645 0.9375
## 3 12.4250 0.5595 74.9400 0.7801 0.3547 0.9331 0.3067 0.9980
## 4 0.3857 0.4977 8.3685 0.7016 1.7257 12.9402 0.2496 1.0000
## 5 0.2960 0.4453 6.1582 0.6461 1.3799 10.3583 0.2972 0.9722
## 6 0.6032 0.4066 4.8538 0.6219 0.9882 8.5647 0.3211 0.9620
## 7 0.4670 0.3766 3.7218 0.5672 0.8816 5.1583 0.3845 0.9423
## 8 1.0052 0.3524 2.9042 0.5140 0.6793 4.3971 0.4548 0.9320
## Gplus Tau Dunn Hubert SDindex Dindex SDbw
## 2 155.3597 4664.155 0.5009 3e-04 1.2374 1.7764 0.1828
## 3 4.7469 4638.747 0.6723 3e-04 0.7843 0.8928 0.0438
## 4 0.0011 3693.465 0.8184 4e-04 0.9362 0.3622 0.0091
## 5 46.8435 3272.053 0.0934 4e-04 5.9589 0.3455 0.0915
## 6 61.0775 3089.934 0.0975 4e-04 5.6107 0.3344 0.0895
## 7 81.9910 2680.056 0.0628 4e-04 6.0590 0.3152 0.1373
## 8 83.6208 2293.822 0.0640 4e-04 5.3941 0.2994 0.1280
res$Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 4.0000 4.000 4.000 4.0000 4.0000 4.00
## Value_Index 26.9498 7163.126 1109.016 30.6642 485.2208 24972.25
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 5.0000 3.0000 4.0000 4.0000 3.0000 4.0000
## Value_Index 71.3167 260.6934 236.3695 -213.6856 0.3043 0.2415
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 4.0000 4.0000 4.0000 4.0000 2.0000 3.0000
## Value_Index 0.8441 0.7175 18.9003 0.3857 0.6392 263.4897
## PtBiserial Gap Frey McClain Gamma Gplus Tau
## Number_clusters 2.0000 4.0000 1 4.0000 4 4.0000 2.000
## Value_Index 0.8002 1.7257 NA 0.2496 1 0.0011 4664.155
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 4.0000 0 3.0000 0 4.0000
## Value_Index 0.8184 0 0.7843 0 0.0091
res$All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale CritValue_Gap
## 2 0.4349 127.3323 0.0000 -0.2859
## 3 0.4349 127.3323 0.0000 -1.3651
## 4 0.3327 96.2763 0.6810 0.3531
## 5 0.3327 96.2763 0.7445 0.4008
## 6 0.3327 96.2763 0.5491 0.1166
## 7 0.3327 96.2763 0.6283 0.2139
## 8 0.2234 93.8395 0.3727 0.1015
res$Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [176] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
set.seed(1)
y <- rbind(matrix(rnorm(150,sd = 0.3),ncol = 3),
matrix(rnorm(150,mean = 3,sd = 0.2),ncol = 3),
matrix(rnorm(150,mean = 5,sd = 0.3),ncol = 3))
diss_matrix <- dist(y, method = "euclidean", diag = FALSE)
res <- NbClust(y, diss = diss_matrix, distance = NULL, min.nc = 2, max.nc = 6,
method = "average", index = "ch")
res$All.index
## 2 3 4 5 6
## 721.3436 4490.9857 3080.0133 2345.5167 2011.6524
res$Best.nc
## Number_clusters Value_Index
## 3.000 4490.986
res$Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
set.seed(1)
x <- rbind(matrix(rnorm(20, sd = 0.1), ncol = 2),
matrix(rnorm(20, mean = 1, sd = 0.2), ncol = 2),
matrix(rnorm(20, mean = 5, sd = 0.1), ncol = 2),
matrix(rnorm(20, mean = 7, sd = 0.2), ncol = 2))
diss_matrix <- dist(x, method = "euclidean", diag = FALSE)
res <- NbClust(x, diss = diss_matrix, distance = NULL, min.nc = 2, max.nc = 6,
method = "complete", index = "alllong")
## [1] "Frey index : No clustering structure in this data set"
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 13 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
res$All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 7.9990 447.811 136.9065 11.9651 137.8320 146.7611 2.6561 51.4968
## 3 2.5510 1070.124 223.8314 13.0324 201.4264 67.3481 0.5114 11.1881
## 4 59.1137 4965.892 12.5187 18.4692 304.1030 9.1919 0.5535 1.5871
## 5 1.2432 4883.169 9.8189 16.9714 326.5920 8.1857 0.4697 1.1776
## 6 2.0958 4861.459 6.9790 15.8757 346.8095 7.1106 0.3324 0.9196
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale
## 2 1023.010 29.3249 0.3700 0.2764 0.8380 0.0210 838.6080 44.1373
## 3 1120.483 134.9769 0.3519 0.1335 0.8870 0.0699 239.3970 12.5998
## 4 2076.653 951.5190 0.2726 0.1986 0.8678 0.4471 9.8935 1.0993
## 5 2707.342 1282.4029 0.3165 0.4692 0.7469 0.5427 6.7421 0.7491
## 6 3515.091 1642.1697 0.3268 0.5444 0.6627 0.4792 6.5198 0.9314
## Ratkowsky Ball Ptbiserial Gap Frey McClain Gamma Gplus
## 2 0.6789 25.7484 0.9220 0.5352 1.7200 0.1667 1.0000 0.0000
## 3 0.5724 3.7294 0.8290 1.6018 2.1893 0.1730 1.0000 0.0000
## 4 0.4994 0.3968 0.6701 3.1798 6.7883 0.1399 1.0000 0.0000
## 5 0.4468 0.2355 0.6161 2.8963 5.2642 0.1524 0.9884 0.7244
## 6 0.4080 0.1533 0.5796 2.4222 7.8179 0.1578 0.9881 0.6833
## Tau Dunn Hubert SDindex Dindex SDbw
## 2 194.8718 1.5955 0.0016 0.6943 1.0666 0.0784
## 3 179.4872 1.3155 0.0018 0.8993 0.4392 0.0125
## 4 138.4615 1.2067 0.0018 1.6575 0.1649 0.0026
## 5 123.3513 0.2498 0.0018 5.4795 0.1467 0.0020
## 6 113.5051 0.2934 0.0019 5.5902 0.1306 0.0016
res$Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 4.0000 4.000 4.0000 4.0000 4.0000 4.0000 3.0000
## Value_Index 59.1137 4965.892 211.3126 18.4692 102.6765 57.1499 2.1448
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 4.0000 4.0000 4.0000 3.0000 3.000 4.0000
## Value_Index 30.7076 956.1706 -485.6583 0.2726 0.1335 0.887 0.4471
## PseudoT2 Beale Ratkowsky Ball PtBiserial Gap Frey
## Number_clusters NA 4.0000 2.0000 3.000 2.000 4.0000 NA
## Value_Index NA 1.0993 0.6789 22.019 0.922 3.1798 NA
## McClain Gamma Gplus Tau Dunn Hubert SDindex Dindex
## Number_clusters 4.0000 2 2 2.0000 2.0000 0 2.0000 0
## Value_Index 0.1399 1 0 194.8718 1.5955 0 0.6943 0
## SDbw
## Number_clusters 6.0000
## Value_Index 0.0016
res$All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale CritValue_Gap
## 2 0.1299 120.5893 0.0000 -1.0459
## 3 0.1299 120.5893 0.0001 -1.5525
## 4 -0.0987 -89.0644 0.3570 0.3178
## 5 -0.0987 -89.0644 0.4887 0.5237
## 6 -0.1908 -37.4469 0.4207 0.2581
res$Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4
## [36] 4 4 4 4 4
# 4.2 Real data set
pairs(iris[, 1:4], bg = c("yellow", "green", "black")[iris$Species], pch = 21)
data <- iris[, -5]
diss_matrix <- dist(data, method = "euclidean", diag = FALSE)
res <- NbClust(data, diss = diss_matrix, distance = NULL, min.nc = 2, max.nc = 10,
method = "complete", index = "alllong")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 15 proposed 3 as the best number of clusters
## * 6 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
res$All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.9652 280.8392 240.7478 30.4441 933.9084 977604.0 6868.5401 235.1531
## 3 5.3598 485.9050 68.8363 35.8668 1210.7629 347351.8 304.1791 89.5250
## 4 54.0377 495.1816 16.4167 35.6036 1346.7582 249402.3 135.7432 60.9730
## 5 0.0263 414.3925 51.1371 33.0698 1387.9419 296129.2 121.5044 54.8099
## 6 7.1653 455.4931 16.8076 33.9870 1506.5585 193380.9 96.9908 40.5198
## 7 0.5308 423.7198 20.2960 32.9063 1560.0089 184311.4 93.2005 36.2847
## 8 2.4071 414.7146 4.4653 32.4873 1628.7974 152185.5 60.9393 31.7749
## 9 6.5604 372.2046 8.2537 31.0319 1646.9164 170694.1 55.3030 30.8062
## 10 0.2708 348.6421 9.1553 30.1191 1680.9385 167969.1 55.2821 29.1026
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale
## 2 715.2826 40.5663 0.3723 0.7027 0.5160 0.1460 444.4821 13.9360
## 3 804.1705 106.5545 0.3163 0.7025 0.5136 0.5582 55.4060 1.8840
## 4 955.5312 156.4512 0.3465 0.7289 0.4998 0.5932 32.9134 1.6216
## 5 991.9852 174.0431 0.3758 0.9838 0.3462 0.5452 48.3914 1.9801
## 6 1070.1736 235.4228 0.4032 1.0524 0.3382 0.5656 19.9691 1.7855
## 7 1171.9307 262.9011 0.3982 1.0030 0.3298 0.6480 19.5552 1.2760
## 8 1251.1704 300.2146 0.4118 1.0738 0.3240 2.1863 -11.9371 -1.2530
## 9 1290.8832 309.6552 0.4098 0.9954 0.3258 0.6340 5.7720 1.2668
## 10 1353.2708 327.7814 0.4045 1.0396 0.3095 0.6575 9.8984 1.1948
## Ratkowsky Ball Ptbiserial Gap Frey McClain Gamma Gplus
## 2 0.4729 117.5765 0.6369 -0.2356 0.2675 0.4228 0.7472 353.1090
## 3 0.4922 29.8417 0.7203 0.1343 0.8589 0.4964 0.8928 139.9284
## 4 0.4387 15.2432 0.6948 -0.1465 134.6913 0.5734 0.9261 87.9342
## 5 0.4026 10.9620 0.6073 -0.3669 1.1448 0.7936 0.8589 149.0951
## 6 0.3738 6.7533 0.5295 -0.3256 0.6883 1.0742 0.8919 88.5252
## 7 0.3482 5.1835 0.5212 -0.5714 1.2624 1.1037 0.9020 77.1718
## 8 0.3275 3.9719 0.4753 -0.6911 0.5934 1.3191 0.9115 58.7781
## 9 0.3092 3.4229 0.4729 -0.9371 0.7370 1.3284 0.9145 56.0378
## 10 0.2941 2.9103 0.4688 -1.1656 0.7430 1.3469 0.9179 52.7862
## Tau Dunn Hubert SDindex Dindex SDbw
## 2 2475.495 0.0824 0.0015 1.8326 1.1446 0.8976
## 3 2649.840 0.1033 0.0020 1.6226 0.6722 0.2350
## 4 2495.851 0.1365 0.0022 1.9103 0.5832 0.1503
## 5 2206.153 0.1000 0.0022 3.4597 0.5513 0.5055
## 6 1728.103 0.1311 0.0023 3.5342 0.4778 0.3126
## 7 1664.993 0.1346 0.0023 3.6106 0.4530 0.2284
## 8 1384.061 0.1529 0.0023 3.9101 0.4239 0.0357
## 9 1367.483 0.1539 0.0023 4.0152 0.4171 0.0312
## 10 1340.581 0.1543 0.0024 4.0261 0.4060 0.0303
res$Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 4.0000 4.0000 3.0000 3.0000 3.0000 3.0
## Value_Index 54.0377 495.1816 171.9115 35.8668 276.8545 532302.7
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 3.000 3.000 4.0000 6.0000 3.0000 3.0000
## Value_Index 6564.361 117.076 151.3607 -33.9014 0.3163 0.7025
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 2.000 4.0000 4.0000 3.000 3.0000 3.0000
## Value_Index 0.516 0.5932 32.9134 1.884 0.4922 87.7349
## PtBiserial Gap Frey McClain Gamma Gplus Tau
## Number_clusters 3.0000 3.0000 1 2.0000 4.0000 10.0000 3.00
## Value_Index 0.7203 0.1343 NA 0.4228 0.9261 52.7862 2649.84
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 10.0000 0 3.0000 0 10.0000
## Value_Index 0.1543 0 1.6226 0 0.0303
res$All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale CritValue_Gap
## 2 0.6121 48.1694 0.0000 -0.3642
## 3 0.6027 46.1391 0.1134 0.2891
## 4 0.5551 38.4707 0.1704 0.2300
## 5 0.5800 42.0003 0.0983 -0.0305
## 6 0.4590 30.6444 0.1373 0.2590
## 7 0.5131 34.1652 0.2822 0.1346
## 8 0.4284 29.3527 1.0000 0.2635
## 9 0.2576 28.8239 0.2990 0.2483
## 10 0.3999 28.5079 0.3200 0.2200
res$Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3
## [71] 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2 2 2 2
## [106] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2
k_Means performs better than Hierarchial Clustering on large datasets.
Pretty simple concept. . . .people that buy also buy. . . . .
Apriori Algorithm consists of 3 parts:
apriori - previous knowledge
Process:
Apriori is a basic example - recommendation engines are more sophisticated
# Data Preprocessing
#We execute the read.csv onlt to see the data. Not needed for read.transactions
dataset = read.csv('./data/Market_Basket_Optimisation.csv', header = FALSE)
head(dataset)
## V1 V2 V3 V4 V5
## 1 shrimp almonds avocado vegetables mix green grapes
## 2 burgers meatballs eggs
## 3 chutney
## 4 turkey avocado
## 5 mineral water milk energy bar whole wheat rice green tea
## 6 low fat yogurt
## V6 V7 V8 V9 V10
## 1 whole weat flour yams cottage cheese energy drink tomato juice
## 2
## 3
## 4
## 5
## 6
## V11 V12 V13 V14 V15 V16
## 1 low fat yogurt green tea honey salad mineral water salmon
## 2
## 3
## 4
## 5
## 6
## V17 V18 V19 V20
## 1 antioxydant juice frozen smoothie spinach olive oil
## 2
## 3
## 4
## 5
## 6
#Now make a sparse matrix - each product will have its own column
dataset = read.transactions('./data/Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1
## 5
#Line above returns some informnation. Interpret as 5 transactions with 1 duplicate
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17
## 16 18 19 20
## 4 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
itemFrequencyPlot(dataset, topN = 10)
# Training Apriori on the dataset
# support and confidence are based on specific circumstances - no rules to follow
# We are going to evalaute the products that are purchased 3 times/day only (iterative process). The dataset contains a weeks worth of purchase data. Therefore support = (3x7)/7500
# Confidence depnds on business rules. Start with default then iterate (.8 is very high)
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# 0 rules are produced by the code above. Confidence is too high. Being right 80% of the time!
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [281 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
#Now 281 rules!
# Visualizing the results
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence lift
## [1] {mineral water,
## whole wheat pasta} => {olive oil} 0.003866151 0.4027778 6.115863
## [2] {spaghetti,
## tomato sauce} => {ground beef} 0.003066258 0.4893617 4.980600
## [3] {french fries,
## herb & pepper} => {ground beef} 0.003199573 0.4615385 4.697422
## [4] {cereals,
## spaghetti} => {ground beef} 0.003066258 0.4600000 4.681764
## [5] {frozen vegetables,
## mineral water,
## soup} => {milk} 0.003066258 0.6052632 4.670863
## [6] {chocolate,
## herb & pepper} => {ground beef} 0.003999467 0.4411765 4.490183
## [7] {chocolate,
## mineral water,
## shrimp} => {frozen vegetables} 0.003199573 0.4210526 4.417225
## [8] {frozen vegetables,
## mineral water,
## olive oil} => {milk} 0.003332889 0.5102041 3.937285
## [9] {cereals,
## ground beef} => {spaghetti} 0.003066258 0.6764706 3.885303
## [10] {frozen vegetables,
## soup} => {milk} 0.003999467 0.5000000 3.858539
Some problems are suggested in the output. Example: [6] chocolate, herb & pepper suggests ground beef. Recall chocolate is one of the most purchased items so it shows up in baskets but does this make sense? This might suggest that the confidence it still too high. (We do not want to change the business rules that drive the support parameter.) Same potential problem with mineral water.
rules = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.02s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1348 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support
## [1] {mineral water,whole wheat pasta} => {olive oil} 0.003866151
## [2] {frozen vegetables,milk,mineral water} => {soup} 0.003066258
## [3] {fromage blanc} => {honey} 0.003332889
## [4] {spaghetti,tomato sauce} => {ground beef} 0.003066258
## [5] {light cream} => {chicken} 0.004532729
## [6] {pasta} => {escalope} 0.005865885
## [7] {french fries,herb & pepper} => {ground beef} 0.003199573
## [8] {cereals,spaghetti} => {ground beef} 0.003066258
## [9] {frozen vegetables,mineral water,soup} => {milk} 0.003066258
## [10] {french fries,ground beef} => {herb & pepper} 0.003199573
## confidence lift
## [1] 0.4027778 6.115863
## [2] 0.2771084 5.484407
## [3] 0.2450980 5.164271
## [4] 0.4893617 4.980600
## [5] 0.2905983 4.843951
## [6] 0.3728814 4.700812
## [7] 0.4615385 4.697422
## [8] 0.4600000 4.681764
## [9] 0.6052632 4.670863
## [10] 0.2307692 4.665768
Many more rules as expected - 1348. The rules appear to make more sense - most of them. Does cereal and spaghetti suggest ground beef with a 46% confidence?
Could try changing support. What happens if we change to products that 4 times/day? (support = (4 x 7)/7500 = 0.00373)
rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.004 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 30
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [114 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [811 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(sort(rules, by = 'lift')[1:10])
## lhs rhs support confidence lift
## [1] {light cream} => {chicken} 0.004532729 0.2905983 4.843951
## [2] {pasta} => {escalope} 0.005865885 0.3728814 4.700812
## [3] {pasta} => {shrimp} 0.005065991 0.3220339 4.506672
## [4] {eggs,
## ground beef} => {herb & pepper} 0.004132782 0.2066667 4.178455
## [5] {whole wheat pasta} => {olive oil} 0.007998933 0.2714932 4.122410
## [6] {herb & pepper,
## spaghetti} => {ground beef} 0.006399147 0.3934426 4.004360
## [7] {herb & pepper,
## mineral water} => {ground beef} 0.006665778 0.3906250 3.975683
## [8] {tomato sauce} => {ground beef} 0.005332622 0.3773585 3.840659
## [9] {mushroom cream sauce} => {escalope} 0.005732569 0.3006993 3.790833
## [10] {frozen vegetables,
## mineral water,
## spaghetti} => {ground beef} 0.004399413 0.3666667 3.731841
811 rules now because of the increased support. Rules look valid. Now business owner could rearrange product placements to improve sales.
Eclat is a simplified version of Association Rules Learning. Eclat talks about sets, not rules.
Eclat only has support:
Eclat provides simple results consisting on items frequently bought together.
# Data Preprocessing
dataset = read.csv('./data/Market_Basket_Optimisation.csv')
dataset = read.transactions('./data/Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)
## distribution of transactions with duplicates:
## 1
## 5
summary(dataset)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17
## 16 18 19 20
## 4 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
itemFrequencyPlot(dataset, topN = 10)
# Training Eclat on the dataset. Note: No confidence parameter. Using the minlen too.
# A setr of 1 item is meaningless.
rules = eclat(data = dataset, parameter = list(support = 0.003, minlen = 2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.003 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 22
##
## create itemset ...
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [115 item(s)] done [0.00s].
## creating sparse bit matrix ... [115 row(s), 7501 column(s)] done [0.00s].
## writing ... [1328 set(s)] done [0.02s].
## Creating S4 object ... done [0.00s].
# Visualising the results. Note: No lift parameter.
inspect(sort(rules, by = 'support')[1:10])
## items support
## [1] {mineral water,spaghetti} 0.05972537
## [2] {chocolate,mineral water} 0.05265965
## [3] {eggs,mineral water} 0.05092654
## [4] {milk,mineral water} 0.04799360
## [5] {ground beef,mineral water} 0.04092788
## [6] {ground beef,spaghetti} 0.03919477
## [7] {chocolate,spaghetti} 0.03919477
## [8] {eggs,spaghetti} 0.03652846
## [9] {eggs,french fries} 0.03639515
## [10] {frozen vegetables,mineral water} 0.03572857
845 Sets are returned. The results are biased by the most popular items. Not as helpful as Apriori.
Learn by example (called 1-armed bandit problem) This allows you to reduce the number of A/B Tests to save money but still identify the best ad.
dataset = read.csv('./data/Ads_CTR_Optimisation.csv')
head(dataset)
## Ad.1 Ad.2 Ad.3 Ad.4 Ad.5 Ad.6 Ad.7 Ad.8 Ad.9 Ad.10
## 1 1 0 0 0 1 0 0 0 1 0
## 2 0 0 0 0 0 0 0 0 1 0
## 3 0 0 0 0 0 0 0 0 0 0
## 4 0 1 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 1 1 0 0 0 0 0 0 0 0
The data is a sparse matrix of different ads. Which will give the best conversion rate? What is the strategy to determine which will give the best CTR. Reinforcement Learning will use the previous records to select what ad to show. (Exploration vs exploitation.)
Lets evaluate the first few records so we understand what is happening.
The first user will click on the ads 1, 5, 9. The second user will only click on the 9th ad. The third will not click through for any ad.
The previous observations are used to determine the next ads displays for future users.
Before proceeding with reinforcement learning, lets see what happens if we randomly select ads to display:
# Implementing Random Selection
N = 10000
d = 10
ads_selected = integer(0)
total_reward = 0
for (n in 1:N) {
ad = sample(1:10, 1)
ads_selected = append(ads_selected, ad)
reward = dataset[n, ad]
total_reward = total_reward + reward
}
total_reward
## [1] 1211
# Visualizing the results
hist(ads_selected, col = 'blue',
main = 'Histogram of ads selections', sub = 'Ave total reward ~ 1200',
xlab = 'Ads', ylab = 'Number of times each ad was selected')
Randomly selecting the ads, on average, the total reward is around 1200. We would expect a nearly uniform distribution since it is randomized.
Now lets see if we do better with reinforcement learning. Unfortunately, there is no r package with this functionality - at least not yet!. Therefore, we will build it from scratch.
# Implementing UCB
N = 10000#number of add tests
d = 10#number of different ads
ads_selected = integer(0)
numbers_of_selections = integer(d)#Step 1
sums_of_rewards = integer(d)#Step 1
total_reward = 0
#Start of Step 2
for (n in 1:N) {
ad = 0
max_upper_bound = 0
for (i in 1:d) {
if (numbers_of_selections[i] > 0) {
#Below is the average reward calculation
average_reward = sums_of_rewards[i] / numbers_of_selections[i]
#Calculate the confidence level
delta_i = sqrt(3/2 * log(n) / numbers_of_selections[i])
upper_bound = average_reward + delta_i
} else {
upper_bound = 1e400#
}
if (upper_bound > max_upper_bound) {
max_upper_bound = upper_bound
ad = i
}
}
#Add elements to a vector. append(x, values, after = length(x))
#A vector containing the values in x with the elements of values appended after the specified element of x.
ads_selected = append(ads_selected, ad)
numbers_of_selections[ad] = numbers_of_selections[ad] + 1
reward = dataset[n, ad]
sums_of_rewards[ad] = sums_of_rewards[ad] + reward
total_reward = total_reward + reward
}
total_reward
## [1] 2178
Note that the total rewards is nearly double the random total rewards!
Lets look at the ads selected:
ads_selected[1:40]
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 1 2
## [24] 3 4 5 6 7 8 9 9 10 1 9 2 3 4 5 6 7
Notice that the first 10 ads were just assigned 1 through 10. After this, the algorithm kicks in and starts learning.
To know what ad is best (best CTR, highest conversion rate), just look at the end of the ads_selected vector:
ads_selected[(N-40):N]
## [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [36] 5 5 5 5 5 5
Ad 5 appears to be the winner!
# Visualizing the results
hist(ads_selected,
col = 'blue',
main = 'Histogram of ads selections',
xlab = 'Ads',
ylab = 'Number of times each ad was selected')
A probabilistic algorithm.
One notable difference is the Thompson Sampling Algorithm uses distributions compared to the Upper Confidence Bound that uses confidence intervals that converges to expected value.
Here is the process that we will develop an R solution for the Thompson Sampling Algorithm.
Major advantage with Thompson Sampling is updates can be batched - useful in web testing so each individual click does not need to be immediately processed - too computationally expensive.
# Importing the dataset
dataset = read.csv('./data/Ads_CTR_Optimisation.csv')
# Implementing Thompson Sampling
N = 10000
d = 10
ads_selected = integer(0)
numbers_of_rewards_1 = integer(d)#initializes a vector of 10 0's
numbers_of_rewards_0 = integer(d)#initializes a vector of 10 0's
total_reward = 0
for (n in 1:N) {
ad = 0
max_random = 0
#rbeta: Density, distribution function, quantile function and random generation for the Beta distribution
#with parameters shape1 and shape2
for (i in 1:d) {
random_beta = rbeta(n = 1,
shape1 = numbers_of_rewards_1[i] + 1,
shape2 = numbers_of_rewards_0[i] + 1)
if (random_beta > max_random) {
max_random = random_beta
ad = i
}
}
ads_selected = append(ads_selected, ad)
reward = dataset[n, ad]
if (reward == 1) {numbers_of_rewards_1[ad] = numbers_of_rewards_1[ad] + 1}
else {numbers_of_rewards_0[ad] = numbers_of_rewards_0[ad] + 1}
total_reward = total_reward + reward
}
total_reward
## [1] 2578
Recall a random selection gave an average reward of ~1200, Upper Confidence Bound produced a total reward of 2178.
Thompson Sampling Algorithm gives ~2578. (There is some randomization in this, the total reward varies.)
# Visualizing the results
hist(ads_selected,
col = 'blue',
main = 'Histogram of ads selections',
xlab = 'Ads',
ylab = 'Number of times each ad was selected')
Again, ad 5 is most selected. Appears even more popular than in the plot for UCB.
Natural Language Processing (or NLP) is applying Machine Learning models to text and language. Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action.
You can also use NLP on a text review to predict if the review is a good one or a bad one. You can use NLP on an article to predict some categories of the articles you are trying to segment. You can use NLP on a book to predict the genre of the book. And it can go further, you can use NLP to build a machine translator or a speech recognition system, and in that last example you use classification algorithms to classify language. Speaking of classification algorithms, most of NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.
A very well-known model in NLP is the Bag of Words model. It is a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.
In this part, you will understand and learn how to:
# Importing the dataset
dataset_original = read.delim('./data/Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)
# quote is very useful in NLP - it ignores all quotes. Note stringsAsFactors = FALSE for NLP.
#Will end up with a sparse matrix where each word is its own column. Of course we will try to minimize the number
#of columns to make the model more accurate
#
# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))#from tm package
class(corpus)
## [1] "VCorpus" "Corpus"
#Look at some content on corpus using 2 methods:
corpus[[1]]$content
## [1] "Wow... Loved this place."
as.character(corpus[[1]])
## [1] "Wow... Loved this place."
corpus[[500]]$content
## [1] "Waitress was sweet and funny."
as.character(corpus[[500]])
## [1] "Waitress was sweet and funny."
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())#from Snowball
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
# Creating the Bag of Words model (sparse amtrix)
dtm = DocumentTermMatrix(corpus)
dtm#Note that Sparsity = 100% and 1577 words
## <<DocumentTermMatrix (documents: 1000, terms: 1577)>>
## Non-/sparse entries: 5435/1571565
## Sparsity : 100%
## Maximal term length: 32
## Weighting : term frequency (tf)
dtm = removeSparseTerms(dtm, 0.999)#keeps 99.9% of the columns
dtm# Sparsity 99% with 691 words
## <<DocumentTermMatrix (documents: 1000, terms: 691)>>
## Non-/sparse entries: 4549/686451
## Sparsity : 99%
## Maximal term length: 12
## Weighting : term frequency (tf)
dataset = as.data.frame(as.matrix(dtm))#Classification Algo requires a data frame
dataset$Liked = dataset_original$Liked
# Below is the same Random Forest Model we used before.
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Fitting Random Forest Classification to the Training set
# install.packages('randomForest')
library(randomForest)
# Remove the Liked column - number 692
classifier = randomForest(x = training_set[-692], y = training_set$Liked, ntree = 10)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])
y_pred[1:20]
## 4 9 10 16 17 21 24 33 39 40 41 48 56 58 59 61 63 73 76 82
## 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0
## Levels: 0 1
# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)
cm
## y_pred
## 0 1
## 0 79 21
## 1 30 70
accuracy <- (79+70)/200
accuracy
## [1] 0.745
Not bad accuracy given the training set was small for NLP analysis. > Naive Bayes, Decsion Tree, Random Forest are also often used in NLP
https://www.r-bloggers.com/build-your-own-neural-network-classifier-in-r/
Deep Learning is the most exciting and powerful branch of Machine Learning. Deep Learning models can be used for a variety of complex tasks:
You will understand and learn how to implement the following Deep Learning models:
The green node above represents a neuron in a hidden layer.
Let’s review the primary activation functions (there are many more).
The one below is most often used:
A neural network with a single layer feed forward is called a perceptron.
A perceptron will calculate a value. The difference between the predicted value and the actual value we want is called the cost function. There are many ways to calculate the cost function but the most common is 1/2 the squared difference. This is basically the error of the predicted value.
When the perceptron evaluates the cost function, the neural net will recalculate new weights to minimize the cost function. Back propagation adjusts all of the weights at the same time.
Here is an example:
Note above that the Exam column is the actual value.
After calculating the cost function, the net adjusts the weights of this specific record to reduce the cost function. Only one record is being evaluated. This continues until cost function is minimized. (Ideally yhat = y.)
Lets now evaluate what happens when we have many records.
For each row, the neural net processes each row sequentially and calculates yhat, compares it to y and calculates the cost function across all the records. The weights will be updated. The weights are the same for all of the accounts. Continues to iterate to minimize the cost function.
This process is called back propagation.
Gradient Descent is a method the determine the optimal value of the weights in the neural net. If you had only one variable, it would be easy. A brute force attack could try 1000 weights to determine the minimal cost function:
The brute force attach becomes unwieldy when have have more variables. (Curse of dimensionality)
Gradient Descent simply calculates the slope of the line; when the line has a negative slope, it moves to the right. If the line has a positive slope, move to the left.
Gradient Descent requires the cost function to be convex - it has one global minimum. A cost function might look like this (a different cost function or one in multi-dimensional space.)
Recall above that every record was evaluated in the neural net, the cost function is calculated and the weights adjusted for the batch.
In stochastic gradient descent, the weights are calculated for each record individually.
Major Differences:
There is another method that falls in between called the Mini Batch Gradient Descent where you define the number of records to run in batch and the the algo runs stochastically.
Lets build a classification model using ANN.
There are many R packages we might consider. Here a a few of them:
Below we will use h20. It is fast, efficient, lots of options (number of layers and number of nodes on each layer), contains a parameter tuning feature.
If H2O needs to be updated, go here and restart RStudio: http://h2o-release.s3.amazonaws.com/h2o/rel-ueno/8/index.html
# Importing the dataset
dataset = read.csv('./data/Churn_Modelling.csv')
dataset = dataset[4:14]#Do not need RowNumber, CustomerID or Surname. Note keep depdendent variable for trianing.
head(dataset)
## CreditScore Geography Gender Age Tenure Balance NumOfProducts
## 1 619 France Female 42 2 0.00 1
## 2 608 Spain Female 41 1 83807.86 1
## 3 502 France Female 42 8 159660.80 3
## 4 699 France Female 39 1 0.00 2
## 5 850 Spain Female 43 2 125510.82 1
## 6 645 Spain Male 44 8 113755.78 2
## HasCrCard IsActiveMember EstimatedSalary Exited
## 1 1 1 101348.88 1
## 2 0 1 112542.58 0
## 3 1 0 113931.57 1
## 4 0 0 93826.63 0
## 5 1 1 79084.10 0
## 6 1 0 149756.71 1
# Encoding the categorical variables as numeric factors - required by algo
dataset$Geography = as.numeric(factor(dataset$Geography,
levels = c('France', 'Spain', 'Germany'), labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
levels = c('Female', 'Male'), labels = c(1, 2)))
# Splitting the dataset into the Training set and Test set - caTools
set.seed(123)
split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling - required. Training ANN requires all numeric values
training_set[-11] = scale(training_set[-11])#scales everything except depdendent variable
test_set[-11] = scale(test_set[-11])
# Fitting ANN to the Training set
h2o.init(nthreads = -1)# -1 uses all the availle cores in an optimize manner
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\cweaver\AppData\Local\Temp\RtmpczMEHB/h2o_cweaver_started_from_r.out
## C:\Users\cweaver\AppData\Local\Temp\RtmpczMEHB/h2o_cweaver_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 843 milliseconds
## H2O cluster version: 3.10.4.8
## H2O cluster version age: 22 hours and 7 minutes
## H2O cluster name: H2O_started_from_R_cweaver_iiw313
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.59 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.2.4 (2016-03-10)
h2o.no_progress()
model = h2o.deeplearning(y = 'Exited', training_frame = as.h2o(training_set),
activation = 'Rectifier', hidden = c(6,6), # hidden layers, number of nuerons in each layer. average nun input and ouput nodes (11/2)
#this is a simple model - likely 1 hidden layer would be OK
epochs = 100, #how many times to repeat steps to update weights
train_samples_per_iteration = -2) #1 is reinforcement learning, more than 1 is batch learning; -2 auto-tunes
# Predicting the Test set results
h2o.init(nthreads = -1)# -1 uses all the availle cores in an optimize manner
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 9 seconds 480 milliseconds
## H2O cluster version: 3.10.4.8
## H2O cluster version age: 22 hours and 8 minutes
## H2O cluster name: H2O_started_from_R_cweaver_iiw313
## H2O cluster total nodes: 1
## H2O cluster total memory: 2.59 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.2.4 (2016-03-10)
h2o.no_progress()
y_pred = h2o.predict(model, newdata = as.h2o(test_set[-11]))#class = Environment - need to make this a vector we can evaluate
y_pred = (y_pred > 0.5)#the cm below works with boolean or 0/1 - y_pred = ifelse(prob_pred, 0.5, 1, 0)
y_pred = as.vector(y_pred)# still h2o object
y_pred = as.vector(y_pred)# this gives us 0/1
# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)
cm
## y_pred
## 0 1
## 0 1533 60
## 1 215 192
(1535+195)/2000
## [1] 0.865
## [1] TRUE
Not bad, 86% accurate
Another H2o example is here:
http://ethen8181.github.io/machine-learning/h2o/h2o_deep_learning/h2o_deep_learning.html (I might copy it below at a later date)
MORE WORK TO DO USING MXNET
Deep Learning is the most exciting and powerful branch of Machine Learning. Deep Learning models can be used for a variety of complex tasks:
You will understand and learn how to implement the following Deep Learning models:
There is much to learn so lets break this into pieces:
CNNs are getting more popular than ANNs. Suggests that image recognition is being used for more and more applications.
The steps we will take to evaluate images is Convolution –> Max Pooling –> Flattening –> Full Connection
A convolution operation is a function that is simply a combined integration of 2 functions. We will ignore that actual equation and get to a practical understanding.
The goals of convolution:
Feature Detector = kernel = filter The feature detector is commonly 3X3 but can also be 5x5, 7x7. etc. A convolutional operation is illustrated by an x in a circle. The step that you move the square is called the stride. Conventionally a stride of 2 is used.
Feature Map –> Convolved Feature –> Activation Map The whole goal is to make the image smaller to make it easier to process.
Use many different feature maps to create a convolutional layer.
RReLU –> Rectified Linear Units
The rectifier is used to increase non linearity because images are nonlinear. (Gets rid of intermediary layers.)
Pooling –> Downsampling
Spatial invariance - features can be a bit different or distorted, the neural network needs flexibility to accommodate.
Below, using max pooling (there are variations) using a stride = 2 ( commonly used), record the max value in each 2x2:
Pooling is really important:
Interesting website: http://scs.ryerson.ca/~aharley/vis/conv/flat.html
Why max pooling? If interested, read this: http://ais.uni-bonn.de/papers/icann2010_maxpool.pdf
We now add a whole ANN to our CNN.
Hidden layers in ANN are called Fully Connected Layers in CNN. The cost function in ANN is called Loss Function in CNN.
In addition to the above remember other things we reviewed:
Learn about other variations on CNNs: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html
Referring to the image below, the neural network does not by itself calculate probabilities for the Dog and Cat to add nicely to 1. The SoftMax function does this. Z1 and Z2 not not necessarily add to 1.
The softmax function, or normalized exponential function,is a generalization of the logistic function that “squashes” a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that add up to 1.
Cross-Entropy is used in CNNs as the loss function instead of the MSE Cost function we used in ANNs.
Cross-Entropy only for classification
Why use Cross-Entropy over MSE? Use Cross-Entropy because:
In the previous section, we used the powerful H2O package in R to build an ANN for a Business Problem.
If you use H2O to build a CNN for Computer Vision, H2O comes with a framework called Deep Water.
However, this Deep Water framework is still in early stage, and its final version hasn’t be released yet.
https://github.com/h2oai/deepwater
https://www.r-bloggers.com/deep-learning-in-r-2/
There are two types of Dimensionality Reduction techniques:
Feature Selection techniques are Backward Elimination, Forward Selection, Bidirectional Elimination, Score Comparison and more. We this in Regression.
In this part we will cover the following Feature Extraction techniques:
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.
PCA is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize. To bring the concept of PCA to life, take a look at this site: http://setosa.io/ev/principal-component-analysis/
From the m independent variables of your dataset, PCA extracts p <= m new independent variables that explain the most of the variance of the dataset regardless of the dependent variable. This makes PCA an unsupervised model (we do not consider the dependent variable).
# Importing the dataset
dataset = read.csv('./data/Wine.csv')
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Customer_Segment, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-14] = scale(training_set[-14])
test_set[-14] = scale(test_set[-14])
# Applying PCA
# install.packages('caret')
library(caret)
pca = preProcess(x = training_set[-14], method = 'pca', pcaComp = 2)#we can plot 2 variables
#FYI - thresh parameter isa cutoff for the cumulative percent of variance to be retained by PCA
training_set = predict(pca, training_set)#create a new training set with the PCA reduction results
head(training_set)
## Customer_Segment PC1 PC2
## 1 1 -3.249569 1.5661160
## 2 1 -2.165889 -0.3186768
## 3 1 -2.501192 1.2353892
## 6 1 -2.941040 2.2999654
## 7 1 -2.393131 1.3228050
## 9 1 -2.418465 1.0367916
training_set = training_set[c(2, 3, 1)]#reorder the columns so dependent variable is in the last column
head(training_set)
## PC1 PC2 Customer_Segment
## 1 -3.249569 1.5661160 1
## 2 -2.165889 -0.3186768 1
## 3 -2.501192 1.2353892 1
## 6 -2.941040 2.2999654 1
## 7 -2.393131 1.3228050 1
## 9 -2.418465 1.0367916 1
test_set = predict(pca, test_set)
test_set = test_set[c(2, 3, 1)]
# Fitting SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Customer_Segment ~ ., data = training_set,
type = 'C-classification', kernel = 'linear')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
cm
## y_pred
## 1 2 3
## 1 12 0 0
## 2 0 14 0
## 3 0 0 10
# Visualising the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('PC1', 'PC2')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Training set)', xlab = 'PC1', ylab = 'PC2',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 2, 'deepskyblue', ifelse(y_grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21, bg = ifelse(set[, 3] == 2, 'blue3', ifelse(set[, 3] == 1, 'green4', 'red3')))
# Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('PC1', 'PC2')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)', xlab = 'PC1', ylab = 'PC2',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 2, 'deepskyblue', ifelse(y_grid == 1, 'springgreen3', 'tomato')))#this has been updated to accomodate 3 classes rather than 2, same line below
points(set, pch = 21, bg = ifelse(set[, 3] == 2, 'blue3', ifelse(set[, 3] == 1, 'green4', 'red3')))
Linear discriminant analysis (LDA) is a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.
LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.[4] LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.
From the n independent variables of your dataset, LDA extracts p <= n new indepdendent variables that separate the most classes of the dependent variable. LDA is a supervised dimensoanlity model.
# Importing the dataset
dataset = read.csv('./data/Wine.csv')
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Customer_Segment, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-14] = scale(training_set[-14])
test_set[-14] = scale(test_set[-14])
# Applying LDA
library(MASS)
lda = lda(formula = Customer_Segment ~ ., data = training_set)#because there are 3 clases, LDA will automatically calculate k-1 = 2 linear discriminants. (Ignore Posterior .X clumns, theu are simply calcualted values used to derive x.LDx)
training_set = as.data.frame(predict(lda, training_set))#LDA returns a matrix, not a dataframe like PCA
training_set = training_set[c(5, 6, 1)]
head(training_set)
## x.LD1 x.LD2 class
## 1 -4.656187 2.081444 1
## 2 -4.336729 1.267238 1
## 3 -3.292202 1.167575 1
## 6 -4.515987 3.265418 1
## 7 -4.627289 3.369602 1
## 9 -3.936469 1.967177 1
test_set = as.data.frame(predict(lda, test_set))
test_set = test_set[c(5, 6, 1)]
# Fitting SVM to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = class ~ ., data = training_set, type = 'C-classification',
kernel = 'linear')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
cm#Note PCA got 100% correct. LDA did not.
## y_pred
## 1 2 3
## 1 12 0 0
## 2 1 13 0
## 3 0 0 10
# Visualizing the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('x.LD1', 'x.LD2')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'SVM (Training set)',
xlab = 'LD1', ylab = 'LD2',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 2, 'deepskyblue', ifelse(y_grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21, bg = ifelse(set[, 3] == 2, 'blue3', ifelse(set[, 3] == 1, 'green4', 'red3')))
# Visualizing the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('x.LD1', 'x.LD2')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
xlab = 'LD1', ylab = 'LD2',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 2, 'deepskyblue', ifelse(y_grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21, bg = ifelse(set[, 3] == 2, 'blue3', ifelse(set[, 3] == 1, 'green4', 'red3')))
PCA and LDA feature extraction techniques work on linear data - when the data is linearly separable. Kernal PCA is adapted to non-linear data. KPCA is unsupervised.
The data points below (on the left) are located mostly along a curve in 2D. PCA cannot reduce the dimensionality from two to one, because the points are not located along a straight line. But still, the data are “obviously” located around a one-dimensional non-linear curve. So while PCA fails, there must be another way! And indeed, kernel PCA can find this non-linear manifold and discover that the data are in fact nearly one-dimensional.
It does so by mapping the data into a higher-dimensional space. This can indeed look like a contradiction, but it is not. The data are mapped into a higher-dimensional space, but then turn out to lie on a lower dimensional subspace of it. So you increase the dimensionality in order to be able to decrease it.
The essence of the “kernel trick” is that one does not actually need to explicitly consider the higher-dimensional space, so this potentially confusing leap in dimensionality is performed entirely undercover. The idea, however, stays the same.
But how do we know in the first place if the data points are non linear for data set which has more than 4 features (the real world case). To visualize such data we need to reduce dimensionality which means we end up using PCA to reduce dimensionality which would be wrong as data might be non linear and we use normal PCA to visualize. Then how does one know whether data is non linear to use kernel PCA rather than PCA? Answer: If you are doing machine learning, don’t visualize, let your model learn it itself. Basically include a KPCA step in your inner resampling loop and test the kernels as parameters, including the linear kernel and any others you want/can afford to test.
knitr::include_graphics("./images/KernalPCA1.png")
# Importing the dataset
dataset = read.csv('./data/Social_Network_Ads.csv')
dataset = dataset[, 3:5]
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[, 1:2] = scale(training_set[, 1:2])
test_set[, 1:2] = scale(test_set[, 1:2])
## 2 4 5 9 12 18 19 20 22 29
## 0 0 0 0 0 0 0 0 1 0
Let’s perform the same analysis but apply Kernal PCA:
# Applying Kernel PCA
# install.packages('kernlab')
library(kernlab)
kpca = kpca(~., data = training_set[-3], kernel = 'rbfdot', features = 2)
training_set_pca = as.data.frame(predict(kpca, training_set))
head(training_set_pca)
## V1 V2
## 1 -10.069428 -1.4873996
## 3 -8.293841 -1.6232411
## 6 -6.498572 -3.1438574
## 7 -2.933372 -6.4951512
## 8 7.088254 -9.3473575
## 10 -2.317751 -0.3716837
training_set_pca$Purchased = training_set$Purchased#Need to add the dependent variable back for KPCA
test_set_pca = as.data.frame(predict(kpca, test_set))
test_set_pca$Purchased = test_set$Purchased
# Fitting Logistic Regression to the Training set
classifier = glm(formula = Purchased ~ ., family = binomial, data = training_set_pca)
# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set_pca[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
# Making the Confusion Matrix
cm = table(test_set_pca[, 3], y_pred)
cm#83% accuracy
## y_pred
## 0 1
## 0 57 7
## 1 10 26
#The original logisti regression without KPCA provided ths:
cmOLD
## y_pred
## 0 1
## 0 57 7
## 1 10 26
# Visualising the Training set results
#install.packages('ElemStatLearn')
library(ElemStatLearn)
set = training_set_pca
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('V1', 'V2')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Training set)',
xlab = 'PC1', ylab = 'PC2',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualising the Test set results
# install.packages('ElemStatLearn')
#library(ElemStatLearn)
set = test_set_pca
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('V1', 'V2')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
main = 'Logistic Regression (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
Since we used a logistic model, we see a linear separation but what really happens is illustrated below:
knitr::include_graphics("./images/KernalPCA2.JPG")
Lets attempt to answer these questions:
To answer these questions, lwe will explore:
To improve model performance, we use model selection techniques that consists of choosing the best parameters. Every time you build a machine learning model, there are 2 types of parameters:
To evalaute model performace, use K-Fold Cross Validation to reduce the variance among test sets.
We wil reuse the kernal SVM model we have already built and apply K-Fold Cross Validation. (This is the one that help determine if a user will click on an add - yes or no.)
# Importing the dataset
dataset = read.csv('./data/Social_Network_Ads.csv')
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set (caTools)
# install.packages('caTools')
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Kernel SVM to the Training set (e1071)
classifier = svm(formula = Purchased ~ ., data = training_set,
type = 'C-classification', kernel = 'radial')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
# Making the Confusion Matrix
cm = addmargins(table(test_set[, 3], y_pred))
cm
## y_pred
## 0 1 Sum
## 0 57 7 64
## 1 3 33 36
## Sum 60 40 100
cm = prop.table(table(test_set[, 3], y_pred))
cm
## y_pred
## 0 1
## 0 0.57 0.07
## 1 0.03 0.33
# Applying k-Fold Cross Validation (caret)
folds = createFolds(training_set$Purchased, k = 10)#always use the dependent variable
cv = lapply(folds, function(x) {
training_fold = training_set[-x, ]#all the data except for fold x
test_fold = training_set[x, ]
classifier = svm(formula = Purchased ~ ., data = training_fold,
type = 'C-classification', kernel = 'radial')
y_pred = predict(classifier, newdata = test_fold[-3])
cm = table(test_fold[, 3], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
cv
## $Fold01
## [1] 0.9032258
##
## $Fold02
## [1] 0.9333333
##
## $Fold03
## [1] 0.9333333
##
## $Fold04
## [1] 1
##
## $Fold05
## [1] 0.9333333
##
## $Fold06
## [1] 0.8666667
##
## $Fold07
## [1] 0.8333333
##
## $Fold08
## [1] 0.9655172
##
## $Fold09
## [1] 0.9310345
##
## $Fold10
## [1] 0.8333333
accuracy = mean(as.numeric(cv))
accuracy#This has much higher confidence than basing a decision of one training set
## [1] 0.9133111
# Visualizing the Training set results
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'Kernel SVM (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# Visualizing the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'Kernel SVM (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
Grid Search optimizes model hyperparameters - the parameters we choose.
The caret package is used because not only does it provide just about any ML algo you may want, it also simplifies hyperparameter optimization.
Remember the caret package is great but using other packages sometimes provides options not available in caret for the same ML algorithm.
# Importing the dataset
dataset = read.csv('./data/Social_Network_Ads.csv')
dataset = dataset[3:5]
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set - caTools
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# Fitting Kernel SVM to the Training set - e1071
classifier = svm(formula = Purchased ~ ., data = training_set,
type = 'C-classification', kernel = 'radial')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-3])
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)
# Applying k-Fold Cross Validation - caret
folds = createFolds(training_set$Purchased, k = 10)
cv = lapply(folds, function(x) {
training_fold = training_set[-x, ]
test_fold = training_set[x, ]
classifier = svm(formula = Purchased ~ ., data = training_fold,
type = 'C-classification', kernel = 'radial')
y_pred = predict(classifier, newdata = test_fold[-3])
cm = table(test_fold[, 3], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
accuracy = mean(as.numeric(cv))
Now lets use the caret package to optimize the hyperparameters.
# Applying Grid Search to find the best parameters - caret
# Automatically boostraps - smae as K=Fo;d so whhy not always Grid Search - RESEARCH THIS
classifier = train(form = Purchased ~ ., data = training_set, method = 'svmRadial')
classifier
## Support Vector Machines with Radial Basis Function Kernel
##
## 300 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 300, 300, 300, 300, 300, 300, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9187545 0.8218811
## 0.50 0.9183500 0.8208724
## 1.00 0.9169522 0.8175208
##
## Tuning parameter 'sigma' was held constant at a value of 1.295548
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 1.295548 and C = 0.25.
classifier$bestTune # gives the best values directly
## sigma C
## 1 1.295548 0.25
You could take the values from the Grid results and plug them directly into the first classifier we did previously to get the same result.
A very high performance gradient boosting model with decision trees implementation (therefore feature scaling is not needed). Great for large datasets - even better than neural nets at times!
You keep the interpretation of your data set since feature scaling is not needed!
# Importing the dataset
dataset = read.csv('./data/Churn_Modelling.csv')
dataset = dataset[4:14]
# Encoding the categorical variables as factors
dataset$Geography = as.numeric(factor(dataset$Geography,
levels = c('France', 'Spain', 'Germany'), labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
levels = c('Female', 'Male'), labels = c(1, 2)))
# Splitting the dataset into the Training set and Test set - caTools
set.seed(123)
split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Fitting XGBoost to the Training set - xgboost
# There are lots of tuning parameters - Need to explore more thoroughly
classifier = xgboost(data = as.matrix(training_set[-11]),
label = training_set$Exited, nrounds = 10)
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
#nrounds is low since the data is pretty small and simple - 100 improves it to 0.195796
# Predicting the Test set results
y_pred = predict(classifier, newdata = as.matrix(test_set[-11]))
y_pred = (y_pred >= 0.5)
# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)
cm
## y_pred
## FALSE TRUE
## 0 1540 53
## 1 212 195
cm = prop.table(table(test_set[, 11], y_pred))
cm
## y_pred
## FALSE TRUE
## 0 0.7700 0.0265
## 1 0.1060 0.0975
# Applying k-Fold Cross Validation - caret
folds = createFolds(training_set$Exited, k = 10)
cv = lapply(folds, function(x) {
training_fold = training_set[-x, ]
test_fold = training_set[x, ]
classifier = xgboost(data = as.matrix(training_set[-11]), label = training_set$Exited, nrounds = 10)
y_pred = predict(classifier, newdata = as.matrix(test_fold[-11]))#this gives probabilities
y_pred = (y_pred >= 0.5)#convert probabilities to 0/1
cm = table(test_fold[, 11], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
## [0] train-rmse:0.417359
## [1] train-rmse:0.369616
## [2] train-rmse:0.341594
## [3] train-rmse:0.325560
## [4] train-rmse:0.315234
## [5] train-rmse:0.309214
## [6] train-rmse:0.305675
## [7] train-rmse:0.301696
## [8] train-rmse:0.299027
## [9] train-rmse:0.297434
accuracy = mean(as.numeric(cv))
accuracy
## [1] 0.88375
Beats the accuracy of ANN!